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ABSTRACT 

The  projection  pursuit  methodology  is  applied  to  the  multivariate  density 
estimation  problem.  The  resulting  nonparametric  procedure  is  often  less  biased 
than  kernel  and  near  neighbor  methods  and  does  not  require  the  specification 
of  a  metric  on  the  data  measurement  space.  In  addition,  graphical  information 
is  produced  that  can  be  used  to  help  gain  geometric  insight  into  the  multi¬ 
variate  data  distribution. 
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1.  Introduction 


The  formal  goal  of  nonparametric  density  estimation  is  to  estimate 
the  probability  density  of  a  random  vector  X.  on  the  basis  of  iid  obser¬ 
vations  x^...x^  without  making  the  assumption  that  the  density  belongs 
to  a  particular  parametric  family.  Often  in  practice  a  more  important 
objective  is  to  gain  geometric  insight  into  the  data  distribution  in  Rn. 

Nonparametric  estimation  of  univariate  probability  density  functions 
has  been  extensively  studied.  Examples  of  successful  methods  are  the  re¬ 
lated  techniques  of  kernel  estimates  (Rosenblatt,  1956;  Parzen,  196?),  near 
neighbor  estimates  (Loftsgaarden  and  Quesenberry,  1965)  and  splines 
(Boneva,  Kendall,  and  Stefanov,  1971).  The  direct  extension  of  these 
methods  to  multivariate  settings,  however,  has  not  been  as  successful  in 
practice.  This  can  partly  be  attributed  to  their  deteriorating  statistical 
performance  caused  by  the  so  called  "curse  of  dimensionality"  (Bellman,  1961) 
which  requires  very  large  bandwidths  (radii  of  neighborhoods)  in  order  to 
achieve  sufficient  counts.  The  resulting  estimates  are  then  highly  biased. 

In  addition,  these  methods  do  not  provide  any  comprehensible  information 
about  the  structure  of  the  multivariate  point  cloud. 

Our  approach  to  multivariate  density  estimation  is  based  on  the  notion 
of  projection  pursuit  (Friedman  and  Tukey,  1974,  and  Friedman  and  Stuetzle, 
1981).  It  attempts  to  overcome  the  problem  of  large  bias  by  extending 
the  classical  univariate  density  estimation  methods  to  higher  dimensions  in 
a  manner  that  involves  only  univariate  estimation.  As  a  by-product 
graphical  information  is  produced  that  can  be  quite  helpful  in  exploring 
and  understanding  the  multivariate  data  distribution. 
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2.  Overview 

The  goal  of  projection  pursuit  methods  is  to  estimate  multivariate 
functions  by  combinations  of  smooth  univariate  (ridge)  functions  of  carefully 
selected  linear  combinations  of  the  variables. 

Our  projection  pursuit  density  estimation  (PPDE)  method  constructs 
estimates  of  the  form 


M 

P„(X>  ■  p0(x)  n  f  (e^.x)  x  £  k" 

m=  l 


O) 


where: 

-  pM  is  the  density  estimate  (current  model)  after  M  iterations 
of  the  procedure. 

-  pQ  is  a  given  multivariate  density  function  to  be  used  as  the 
initial  model. 

-  9^  is  a  vector  of  direction  cosines  specifying  a  direction  in 
Rn,  thus 


e 

-m 


6mixt 

1  =  1 


is  a  linear  combination  of  the  original  coordinate  measurements. 

-  f  is  a  univaraite  function, 
m 

From  (1)  PPDE  is  seen  to  approximate  the  multivariate  density  by 

an  initially  proposed  density  pQ,  multiplied  (augmented)  by  a  oroduct  of 

univariate  functions  f  of  linear  combinations  of  the  coordinates. 

m 

The  choice  of  an  initial  density  is  left  to  the  user  and  should  reflect 
his  best  initial  knowledge  of  the  data.  A  Gaussian  density  with  the  same 
mean  and  covariance  matrix  as  the  sample  is  often  a  natural  choice.  It  is 
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the  purpose  of  PPDE  to  choose  the  directions  and  construct  the  correspon¬ 
ding  functions  f  (8; x.) .  the  product  of  which  estimates  the  ratio  of  the  data 
density  to  the  initial  model  density. 

From  (1),  we  obtain  the  recursion  relation 

PM'*.)  =  — ^ 

Since  f„  is  used  to  modify  pu  ,  to  obtain  pu,  we  refer  to  the  f  as 
M  ri- 1  pi  in 

"augmenting  functions". 

This  recursive  definition  of  the  model  (2)  suggests  a  stepwise 

approach  for  its  construction.  At  the  Mth  iteration  there  is  a  current 

model  pu  ,(x)  constructed  from  the  previous  steps.  (For  the  first  step  M=1 
-  M-l  — 

the  current  model  is  the  initial  model  pQ(x)  specified  by  the  user.) 

Given  p^^x)  we  seek  a  new  model  pM(x)  (2)  to  serve  as  a  better  approxi¬ 
mation  to  the  data  density  p(x) .  Thus,  a  direction  6^  and  its  correspon¬ 
ding  augmenting  function  fg^-x)  are  chosen  to  maximize  the  goodness-of- 

— M 

fit  to  pM(xJ.  In  this  context  we  use  the  log-likelihood 
W  =  N J log[pM(x)]  p(x)dx 

as  a  measure  of  relative  goodness-of-fit.  From  (?)  we  see  that  the  likeli¬ 
hood  achieves  its  maximum  at  the  same  location  as 

w(e,f0)  *  Nyiog[fQ(e-x}]  p(x)dx  .  (3) 


This  is  to  be  maximized  under  the  constraint  that  pM(x)  be  properly 
normalized.  That  is. 


f0(e-x)dj< 


1  . 


(4) 
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For  a  given  direction  e_,  and  known  p(x.) 


fe(e*x)  =  P(e-x)  /  PM_1 (£-x)  (5; 

is  seen  to  maximize  (3)  subject  to  (4).  Here  P(e/£)  and  PM  1  (ei -_x) 
represent  the  data  and  current  model  marginal  densities  (respectively) 
along  the  (one  dimensional)  subspace  spanned  by  e_.  Using  this 

f  it  remains  to  find  a  direction  6  that  causes  (3)  to  achieve  a  maximum 
y  — 

value.  This  0*.  and  its  corresponding  augmenting  function 


VV 


x)  =  f 


(6) 


define  the  new  model  through  (2). 

In  actual  appl  ications the  data  density  p(_x)  is  unknown.  We  have  instead 
a  sample  x^,  x^,...x^  of  size  N  that  is  assumed  to  represent  N  independent 
samplings  from  p^x).  This  sample  is  used  to  estimate  the  marginal  data 
dens.ty  P(0/_x)  as  well  as  w(e_,  f0).  A  Monte  Carlo  technique  is  used  to 
estimate  the  current  model  marginal  PM  ^(0*x).  The  ratio  of  the  marginal 
density  estimates  is  used  to  estimate  fQ(£-x)  (5)  and  the  optimal  value 
that  maximizes  the  likelihood  estimate  is  determined  by  a  numerical  optimi¬ 
zation  procedure  (see  Friedman  and  Stuetzle,  1981). 

3.  Estimation  Procedures. 

We  now  discuss  the  estimation  of  the  relevant  quantities  in  (3)  -  (5). 
First  consider  the  current  model  marginal  P^  i(9/x.).  Without  loss  of 
generality  we  let  8  be  the  first  coordinate  axis,  that  is  0_-x  =  x^  Then 


M-l 


(x 


)  -  f pM-l(*>d*  dx  • • -dx 


(7) 
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If  f>M_ is  continuous  then 


fWx>)  *  J1”  2h 


1  f  V” 

Ml  PH-l^dZ 

J  Xj-h 


(3) 


k  /  "vh  s  z  s  vh>  pM-i(z)dz 


J 


f 


(9) 


where 


I(s) 


1  i f  s  is  true 
0  otherwise. 


(10) 


From  (7)  one  h?s 


pm-i(xi}  =  2¥  fl{* rh  -  y\  -  Vh)  PM-l(^)d^ 


=  1  im  -=7-  E 
h-K)  2h  p 


M-l 


[I (Xj-h 


'i  -  X1 


h)]. 


(12) 


Our  estimate  of  P^^Xj)  is  obtained  from  (12)  by  using  a  small  finite 
value  for  h  (called  the  span)  and  employing  a  Monte  Carlo  method  to 
estimate  the  expected  value.  A  Monte  Carlo  sample  ,  . . . y^  of 

size  N$  is  generated  with  density  p^jU)  and 


1 

2hN$ 


£  I(Vhs  y,j  *  Vh> 

j  =  l 


(n) 


is  taken  as  our  estimate  of  P^  j(xj).  Since  the  choice  of  Xj  as  the 
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direction  £  was  arbitrary  (13)  can  equally  well  be  written 

Ns 

Pm.1(£'2<}  =  !(£'*_  -•«  *  £'£:  *  £-x+h) 

*J-1 


(14) 


for  any  6_.  Note  that  the  same  Monte  Carlo  sample  can  be  used  for  all  6^ 
and  x.-  In  Appendix  2  we  discuss  in  detail  procedures  for  generating  a 
Monte  Carlo  sample  from  the  density  PM  ^(x.). 

By  assumption  the  data  represent  a  sample  from  p(xj  that  can  be 
used,  in  analogy  with  (14),  to  estimate  the  data  marginal  P(e-x)  as 

N 

P(e/20  =  I(!‘x  -h  <;  £-x^  s  6/x+h).  (15) 

i=l 

From  (5)  our  estimate  of  fa(0/x)  becomes 


N. 


*  .  \  i=l 

fe(e-x)  =  - fj— 


I(e/x  -  h  <  8-x_,j  s  i-x+h) 


(16) 


N  /  I(fl.x  -h  s  e-£.  <  0/x+h) 

J 

j  =  l 

This  is  just  the  ratio  of  the  fraction  of  real  counts  to  the  fraction 
of  Monte  Carlo  counts  in  an  interval  of  width  2h  centered  at  6/x.- 
In  order  to  help  stabalize  the  denominator  we  choose  h  to  always  include 
exactly  kN$  Monte  Carlo  observations.  In  this  case  (16)  becomes 


f0(o/x)  =  1(6/ Jt  -h  <  e->u  <  0-x+h)  . 


(17) 


i=l 


1 
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Using  (17)  the  log-likelihood  (3)  can  be  estimated  as 

N 

w(e,fQ)  =  ^log  ^(e-^)-  (18) 

i=l 

The  direction  eu  that  maximizes  (18)  along  with  its  corresponding  augmenting 

function  f„(eu-x)  (6)  define  the  Mth  term  of  the  PPDE  model  (1). 
n  — 

4.  Redundant  Variable  Elimination 

For  purposes  of  interpretation,  it  is  desirable  that  models  be 
parsimonious.  That  is,  models  should  involve  only  as  many  variables  as  are 
required  for  an  adequate  description  of  the  data.  For  models  constructed  by 
projection  pursuit,  this  means  that  each  solution  linear  combination  p„  should 

-r! 

involve  only  those  predictor  variables  that  are  necessary.  Due  to  sampling 
fluctuations,  it  often  happens  that  several  variables  enter  into  a  solution 
linear  combination  (usually  with  small  coefficients),  but  their  removal  will 
not  substantially  effect  the  quality  of  the  solution.  This  is  especially  true 
if  some  of  the  variables  are  highly  correlated. 

Redundant  variables  entering  into  a  solution  linear  combination  6^ 

can  be  eliminated  by  the  following  (reverse)  stepwise  procedure.  Each  non¬ 
zero  coefficient  is  in  turn  set  to  zero,  keeping  all  other  coefficients  at 
their  solution  values;  the  corresponding  augmenting  function  is  computed 
and  the  log-likelihood  is  obtained.  That  variable  xL,  for  which  this  (deleted) 
log-likelihood  w,  is  largest  becomes  a  candidate  for  elimination.  Let 

u 

xL  be  the  variable  for  which  the  deleted  log-likelihood  wL  is  smallest, 
and  w  be  the  log-1 ikel ihood  for  the  complete  solution  (no  variables 

c 

deleted) . 
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If 

wc  '  Hj  >  p(vfc  "  V  09) 

then  the  elimination  procedure  stops  and  the  complete  solution  is  accepted. 
Otherwise  x^  is  deleted  (coefficient  set  to  zero)  and  the  above  procedure 
is  repeated  (next  iterative  pass)  for  all  variables  with  non  zero  coefficients. 
This  iterative  procedure  terminates  when  the  candidate  variable  for  an 
iteration  x^  cannot  be  deleted  [(19)  true].  The  quantity  6  is  a  user 
specified  parameter  with  a  value  between  0.1  and  0.2. 

5.  Termination  Criteria 

As  with  any  stepwise  procedure,  one  needs  a  criterion  for  stopi 
the  iteration  at  some  (Mth)  step.  Stopping  too  soon  can  increase  the  oias 
of  the  estimator,  while  not  stopping  soon  enough  can  unduly  increase  its 
variance.  "Optimal"  termination  of  stepwise  procedures  has  been  studied 
(see  Stone,  1974,  and  references  therein);  these  methods  can  be  applied 
here.  In  practice,  stepwise  procedures  are  usually  terminated  subjectively, 
based  on  inspection  of  the  successive  values  of  goodnes  s  -  o  f- f  i  t 
criterion. 

PPDE  can  provide  several  additional  aids  in  judging  whether  a  new 
step  enhances  the  model  enough  to  be  included.  One  can  compare  Pu  n(ou-x) 

(the  current  model  marginal  projected  onto  e^-x)  and  P(e^-x.)  (the  actual 
data  marginal  projected  on  e^-x).  The  ratio  of  these  two  densities  would 
be  the  Mth  augmenting  function.  Since  is  chosen  to  maximize  (in  the 
likelihood  sense)  the  difference  between  the  projected  model  and  data  densities, 
their  comparison  in  this  projection  represents  a  genuine  comparison  of  the 
full  mul tivariate  densities  for  equality.  Our  experience  indicates  that 
graphical  comparisons  are  most  effective. 


Graphical  inspection  of  f^(0.*x.)  can  be  additionally  used  to  judge 
whether  it  should  be  included  in  the  model.  If  the  graph  of  f ^(_o •  x_) 
versus  e^-x.  displays  a  noisy  pattern  with  no  systematic  tendency,  then  its 
inclusion  will  likely  increase  the  variance  of  the  density  estimate.  On 
the  other  hand,  a  definite  tendency  indicates  that  f.,(bu-x)  is  dealing 

M  — M  — 

with  a  genuine  inadequacy  of  the  present  model. 

6.  Expression  of  the  Results 

from  the  formal  point  of  view,  the  result  of  applying  <  POE  is  an 
estimate  of  the  data  density  specified  by  the  initial  model,  a  series  of 
directions  (unit  vectors)  0^  €  Rn,  and  a  corresponding  set  of  augmenting 
functions  f m (0  * x_) •  The  augmenting  functions  can  be  stored  as  specific 
values  associated  with  each  observation.  Due  to  their  inherent  smoothness 
however,  this  representation  is  highly  redundant  and  a  bit  cumbersome.  For 
this  reason,  we  approximate  each  augmenting  function  by  a  cubic  spline 
function 

L 

»„<*> -£  (20) 
*=1 

The  B  (z)  are  basic  cubic  B-splines  (deBoor,  1979),  the  Bem  are  determined 
by  a  least  squares  fit  of  sm  to  the  "observations"  [e_*x. ,  fm( 9 - x_- )  ] 

(1  <  i  <  N).  The  number  of  knots  L  is  chosen  to  be  inversely  proportional 
to  the  Span  k  (17).  The  internal  knots  are  placed  such  that  equal  num¬ 
bers  of  Monte  Carlo  observations  fall  between  each  pair. 

7.  Examples 

We  illustrate  the  use  of  PPDE  by  applying  it  to  two  examples.  (A 
FORTRAN  program  implementing  the  PPDE  procedure  is  available  from  the 
authors.)  In  all  examples,  the  initial  model  pQU)  was  taken  to  be  Gaussian 
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with  the  sample  mean  and  covariance  matrix;  the  logarithm  of  the  likelihood 

« 

of  the  data  sample  under  the  initial  model  was  arbitrarily  set  to  zero; 
the  size  of  the  Monte  Carlo  sample  was  taken  to  be  twice  the  data  sample 
size. 

The  first  example  is  especially  simple  and  was  chosen  primarily  to 
facilitate  the  exposition  of  the  functioning  of  the  algorithm.  Here,  75 
observations  were  generated  in  two  dimensions  from  each  of  three  Gaussian 
distributions,  with  unit  covariance  matrix  and  centers  at  the  vertices  of 
an  equal  lateral  triangle  of  length  six  on  each  side.  Figure  la  depicts  the 
true  density  function  by  showing  a  few  of  its  isopleths  in  the  plane.  In  all, 
there  are  225  data  points  in  two  dimensions.  Figure  lb  shows  a  scatterplot 
of  the  actual  data.  Since  the  (simulated)  data  for  this  example  is  only 
two-dimensional,  it  is  possible  to  monitor  the  progress  of  the  PPDE  procedure 
as  it  attempts  to  iteratively  construct  the  two-dimensional  density  from 
one-dimensional  projections. 

Figure  2a  shows  some  isopleths  of  the  initial  model  approximation 
p  (x)  -  Gaussian  with  the  sample  mean  and  covariance  matrix.  Figure  2b 
plots  the  data  (solid  histogram)  and  a  Monte  Carlo  sample  drawn  from  pQ(xJ 
(stars)  as  projected  on  the  first  solution  linear  combination  e_  =  (0,1). 
Figure  2c  shows  a  plot  of  the  first  augmenting  function.  Figures  2b  and  2c, 
as  well  as  the  increase  of  the  log-likelihood  AW  =  79.2,  indicate  that 
the  model  after  the  first  iteration,  p^x)  =  p  (x)  fj(0.j*x),  is  a  substan¬ 
tial  improvement  over  the  initial  model.  Figure  2d  shows  some  isopleths  of 
pj(_x)  in  the  plane.  Comparison  of  Figures  2a  and  2d  verifies  this  accessment. 

Figure  3a  plots  the  data  and  a  Monte  Carlo  sample  drawn  from  p  (x)  as 
projected  on  the  second  solution  linear  combination  e_2  =  (0.85,  0.53).  Figure 
3b  shows  the  second  augmenting  function  f 2 ( e^2 * x_)  versus  e.  -x  .  The  increase 


11  - 


in  log-likelihood  AVI 2  =  108.6  and  Figures  3a  and  3b  indicate  that 
P2(x.)  =  P0(x.)  ^ 2 ^—2  — ^  ’s  a  substantial  improvement  over  Pj(x). 

This  verified  by  Figure  3c  which  shows  some  isopleths  of  p2(x).  The 
three  peak  structure  of  p(_x)  -the  true  data  density-  is  now  reproduced 
in  the  estimate  p2(x). 

Figure  4a  and  4b  show  the  data  and  Monte  Carlo  -drawn  from  P2(x)- 
projected  on  the  third  solution  linear  combination  =  (1,0),  and  the 
corresponding  augmenting  function  -f  3(_e  ■  x_) .  The  log-likelihood  improve¬ 
ment  AW ^  =  16.2,  as  well  as  Figures  4a  and  4b,  indicate  that  p3(x_)  = 
p  (x)  f 3( 0_3 - 2i)  provides  at  best  a  little  improvement  over  p?(x).  Figure 
4c  shows  isopleths  of  p  ( xj .  From  Figures  4b  and  4c,  we  see  that  p  (x) 
has  slightly  increased  the  size  of  the  peak  centered  at  (3.0,  2.5)  which 
was  a  little  underestimated  by  p2(x). 

The  small  increase  in  log-likelihood  for  the  third  iteration  as  com¬ 
pared  to  the  increases  corresponding  to  the  previous  iterations,  as  well 
as  inspection  of  Figures  4a  and  4b,  indicate  that  P3(x.)  does  not  provide 
sufficient  enhancement  of  the  density  estimate  to  warrant  its  acceptance. 
We  would  thus  terminate  the  iterative  procedure  and  take  as  our  final 
density  estimate,  p2(xj.  More  iterations  would  be  able  to  provide  further 
refinement  of  the  density  estimate  if  the  sample  size  were  large  enough 
to  control  the  additional  variance  that  would  be  introduced.  In  order  to 
exactly  reproduce  the  data  density  of  this  example,  an  infinite  number 
of  iterations  would  be  needed,  requiring  infinite  sample  size.  As  seen 
from  Figure  3c,  however,  only  two  augmenting  functions  of  these  particular 
linear  combinations  provide  a  reasonably  good  approximation. 

Table  1  compares  PPDE  to  ten  nearest  neighbor  density  estimation 
for  this  same  problem  of  three  unit  covariance  Gaussian  clusters 
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centered  at  the  vertices  of  an  equal  lateral  triangle  of  length  six  on  a 
side.  In  addition  to  the  two-dimensional  example,  we  also  consider  this 
problem  in  five  and  ten  dimensions.  The  quantity  shown  in  Table  1,  for 
both  estimates,  is  the  expected  squared  difference  between  the  true  density 
and  the  estimate,  divided  by  the  variance  of  the  true  density,  and  then 
subtracted  from  one.  A  value  of  one  for  this  quantity  means  that  the 
estimate  is  perfect,  while  a  value  near  zero  indicates  that  the  variation 
of  the  estimate  from  the  true  density  is  nearly  as  large  as  the  variation 
of  the  true  density  itself.  The  values  shown  in  Table  1  are  the  averages 
of  the  results  obtained  in  ten  Monte  Carlo  replications  of  this  example. 

Table  1  indicates  that  in  two  dimensions  their  performance  is  comparable. 
However,  as  the  dimension  of  the  measurement  space  increases,  the  performance 
of  the  ten  nearest  neighbor  estimate  degrades  much  more  rapidly  than  PPDE, 
"explaining"  only  about  ZOZ  of  the  density  variation  in  ten  dimensions. 

The  next  example  illustrates  the  use  of  PPDE  in  a  purely  data  analytic 
setting.  For  this  example,  we  use  data  from  the  diabetes  study  of  Reaven 
and  Miller  (1979).  For  each  of  145  subjects  in  the  study,  five  variables 
were  measured:  (1)  relative  weight,  (2)  a  measure  of  glucose  tolerance, 

(3)  a  second  measure  of  glucose  tolerance  -  glucose  area,  (4)  a  measure  of 
insulin  secretion  -  insulin  area,  and  (5)  a  measure  of  how  glucose  and 
insulin  interact  -  SSPG.  Variables  (2)  and  (3),  the  two  measures  of  glucose 
tolerance,  exhibit  a  high  degree  of  linear  association  (r  =  0.96)  so  that 
only  variables  (1),  (3),  (4),  and  (5)  are  considered. 

Our  purpose  with  this  example  is  to  see  how  well  the  four-dimensional 
data  density  p(xj  can  be  represented  as  a  product  of  two  two-dimensional 
marginal  densities  Pab ^ xa * xb ^  pcd^xc  *d^'  the  ddta  density  could  be 
factored  into  such  a  product  for  a  specific  pairing  of  variables,  (ab)  (cd). 


13  - 


then  all  of  the  data  sturcturing  in  the  full  four-dimensional  space  would 
be  apparent  by  viewing  the  two  scatterplots  -  variable  a  versus  variable  b, 
and  variable  c  versus  variable  d. 

Unlike  the  previous  example,  the  initial  model  is  not  explicitly  de¬ 
fined.  Here  the  initial  model  is  the  factored  approximation 


p0<i>  5  pab(Vxb>  pcd(j<b,xd>  (21) 

with  a  specific  choice  for  the  variables  a,  b,  c  and  d.  The  two-dimensional 
marginal  densities  in  (21)  are  taken  to  be  the  actual  data  as  projected 
onto  the  subspaces  spanned  by  (x  ,  x.  )  and  (x  ,  x,),  respectively.  Since  it 
is  not  our  purpose  to  provide  explicit  density  estimates,  it  is  not  neces¬ 
sary  to  have  an  explicit  (computable)  representation  for  Pq(a)-  All  that 
is  necessary  is  that  we  be  able  to  draw  a  sample  from  it.  Such  a  sample 
of  size  N  (here  N  =  145)  is  generated  by  randomly  permuting  the  observation 
labels  of  the  (x^x^)  pairs  with  respect  to  the  (x  x^)  pairs.  Let 

(r^  r2 ,  ....  r^)  be  a  random  permutation  of  the  integers  (1,2 . N).  The 

Monte  Carlo  sample  from  the  initial  model  is  taken  to  be  the  four  tuples: 


xi a  xib  Xr}c  xr jd 
X2a  x2b  xr2c  Xr?d 


xNa  xNb  x 
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As  many  Monte  Carlo  observations  as  needed  can  be  obtained  by  repeating 
this  procedure  with  different  random  permutations  ( r ,  ,  r2,  ....  r^). 

Table  2  shows  the  increase  in  log-likelihood  achieved  by  PPDE,  after 
two  and  four  iterations,  starting  with  the  four  different  initials  models 
(21)  specified  by  the  four  distinct  groupings  (a,b),  (c,d).  It  is  clear 
that  the  least  improvement  was  associated  with 

PQ(x)  =  P,  3(Xj  ,x3)  P24U2,x4),  (22) 

indicating  that  this  factorization  gives  the  best  representation  of  the 
actual  data  density.  Figure  5a  shows  a  scatterplot  of  x  versus  x3  and 
Figure  5b  shows  x2  versus  x4  for  the  data  sample.  The  results  in  Table  2 
indicate  that  this  is  the  best  pair  of  plots  to  view  the  four-dimensional 
data  structuring.  However,  starting  with  an  initial  density  equal  to  the 
product  of  the  two  (two-dimensional)  densities  shown  in  Figures  5a  and  5b, 
PPDE  was  able  to  construct  a  model  with  substantially  greater  likelihood 
(Table  2)  indicating  that  Figures  5a  and  5b  do  not  reveal  all  of  the  data 
structuring  in  the  full  four-dimensional  space. 

This  is  verified  in  Figure  5c  where  the  145  data  points  (solid)  and 
145  Monte  Carlo  points  drawn  from  p  (x.)  (22)  ("+"  signs)  are  shown  pro¬ 

jected  on  the  plane  spanned  by  the  first  two  solution  directions 
0  =  (-0.29,  -0.37,  0.38,  0.80)  and  0_2  =  (-0.08,  0.92,  -0.37,  -0.11). 

The  horizontal  axis  is  0_j -_x  and  the  vertical  axis  is 

/e  -  (e  -e  )  o  )  •  x/|o  -  (o  •£)  )  0,|.  The  data  are  seen  to  have 

\— 2  ~2  “I  -1  /  —  '-2  -2  -1  -1  1 

somewhat  the  same  shape  as  the  factored  approximation  (22)  but  to  be  more 
tightly  concentrated,  especially  in  the  circular  "ball"  centered  at  (0,0). 


As  a  formal  estimator  of  a  multivariate  density  function  PPDE  shares 
advantages  common  to  projection  pursuit  procedures.  Since  all  estimation 
is  carried  out  in  a  univariate  setting,  the  high  bias  inherent  in  other 
multivariate  nonparametric  density  estimators  can  often  be  avoided.  PPDE 
does  not  require  the  specification  of  a  metric  in  the  n-dimensional  data 
space.  Also,  the  PPDE  estimate  can  be  represented  in  a  concise  functional 
form  [(1)]  and  [(20)]. 

Bias  is  encountered  with  stepwise  procedures  when  many  terms  are  re¬ 
quired  to  provide  a  good  representation  of  the  true  data  density,  but  only 
a  few  can  be  estimated  due  to  insufficient  sample  size.  In  these  cases,  it 
is  important  that  the  first  few  terms  be  able  to  approximate  a  wide  variety 
of  functions  so  that  the  most  salient  features  of  the  data  density  can  be 
modeled.  In  the  limit  M  -*oo,  any  density  function  can  be  represented  by 
(1)  (for  any  pQ),  but  even  for  moderate  M,  functions  of  this  form  constitute 
a  rich  class.  In  addition,  the  choice  of  initial  model  pQ  permits  the  user 
to  introduce  any  knowledge  he  may  have  concerning  the  density  function, 
thereby  allowing  a  further  reduction  in  bias. 

The  success  of  PPDE  will,  of  course,  depend  on  the  particular  nature 
of  the  actual  data  density.  Examples  of  density  functions  requiring  large 
M  in  (1)  are  those  with  highly  concave  isopleths  or  spherically  nested  isopleths 
of  the  same  density  value  (unless,  of  course,  this  structure  is  anticipated 
and  incorporated  in  the  initial  model  pQ). 
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APPENDIX  1:  Back  Adjustment  of  the  Augmenting  Functions. 

The  basic  iterative  procedure  described  in  Sectior  2  is  (using  the 
language  of  linear  regression)  "stagewise"  in  that  each  and  its  cor¬ 
responding  f„  are  chosen  as  the  solution  to  an  optimization  problem  holding 
all  previous  and  (m<M)  fixed.  It  is  sometimes  possible  to  improve 
the  goodness-of-f it  of  the  model  (without  decreasing  the  number  of  degrees 
of  freedom)  by  refitting  all  fm  (m  <;  M)  after  each  f^  is  included  in  the 
model.  This  is  done  in  an  iterative  manner  similar  to  the  basic  (outer) 

iteration  procedure  except  that  the  directions  6  (1  s  m  <  M)  are  held  fixed 

-m 

to  avoid  the  (costly)  numerical  optimization.  At  each  stage  m  of  this  inner 
iterative  procedure  f  (0/x)  is  readjusted  to  maximize  the  log-likelihood 
(3)  given  all  f -(0_--x)>  j^m.  One  complete  passthrough  this  Inner  Iteration 

J  J 

produces  a  new  set  of  augmenting  functions  comprising  a  model  with  (possibly) 

higher  likelihood.  Since  this  pass  has  changed  each  f  it  is  possible 

that  yet  another  pass  can  increase  the  likelihood  still  further.  Thus,  the 

passes  themselves  are  iterated  until  no  increase  in  likelihood  is  observed. 

We  now  discuss  the  calculation  of  a  new  f  (o  -x)  given  f.(e.*x)  j/m. 

m'-m  -  3  j— j  -  J 

Let 


We  seek  a  new  function  f‘(o  -x)  that  maximizes 

m  -m  — 

w'(f;)  =  N J 1 og[f ' (e^-x)]  p(x)dx  (A2) 

subject  to  the  constraint 

f  P(m)(x)ft;(on-x)dx  -  1.  (A3) 


the  solution  is 


(A4) 


where  the  numerator  and  denominator  represent  the  corresponding  projected 

marginal  densities.  These  marginal  densities  are  estimated  in  the  manner 

described  in  Section  3.  The  resulting  estimate  for  f'  then  replaces  f 

m  m 

in  the  new  model  p^x). 
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APPENDIX  2:  Monte  Carlo  Sampling 

To  apply  PPDE,  it  is  necessary  to  draw  a  Monte  Carlo  sample  from 
the  initial  model  po(x),  respectively  from  the  current  model  PM_j(x). 

For  many  choices  of  pQ(x),  there  exist  special  algorithms  that  allow  effi¬ 
cient  direct  sampling  (see  Everett  and  Cashwell,  1973).  Densities  for 
which  this  is  not  the  case  can  be  sampled  using  the  accept/reject  method. 

Suppose  a  Monte  Carlo  sample  drawn  from  density  q(x]  is  available  and 
one  wishes  a  sample  drawn  from  p(x_).  Let 
P(x) 

Y  =  max  -  .  (A5) 

x  q(x) 

Consider  each  Monte  Carlo  observation  x^  in  turn.  For  each,  draw  a  random 
number  r.  in  the  interval  [0,1].  If  r^y  s  pfx  j/qfx^) then  the  ith 
observation  is  accepted;  otherwise  it  is  rejected.  The  accepted  Monte  Carlo 
observations  will  be  a  random  sample  drawn  from  p(x] . 

This  accept/reject  method  can  be  used  to  draw  a  random  sample  from 
any  density  p(x) .  The  efficiency  of  the  procedure  (number  accepted,  divided 
by  number  accepted  plus  number  rejected)  will  be  greater  the  closer  q(x) 
resembles  p(x.)  in  the  sense  of  low  variability  of  p(x)/q(x). 

The  form  of  pM  ^(x)  [(1)  and  (2)]  permits  reasonably  efficient  sampling 
with  the  accept/reject  method.  First  the  random  sample  drawn  from  pfl)  ^(x), 
—  available  from  the  previous  iteration—  is  considered.  From  (1)  and  (A4), 
we  have 


PN-2<i>  m=1 


<A6) 
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We  estimate  the  maximum  value  of  (A6)  by  its  largest  value  over  the  data 
sample.  Applying  the  accept/reject  procedure  to  the  pM  2  M°nte  Carlo 
sample  using  (A6),  yields  a  (smaller)  sample  drawn  from  pM  ^U).  The  re¬ 
maining  Monte  Carlo  observations  are  drawn  from  p^  ^(x)  by  applying  the 
accept/reject  procedure  to  a  sample  from  pQ(_x)  using 


PH-1 <*> 
P0(i) 


M-l 

=  n 

m=l 


<vs> 


(A7) 


Again,  the  maximum  value  of  (A7)  is  estimated  by  its  largest  value  over 
the  data  sample. 

These  maximum  value  estimates  are  clearly  biased.  The  result  of  this 
bias  is  to  introduce  a  small  (usually  negligible)  additional  bias  to  the 
resulting  density  estimates.  This  bias  contribution  can  be  eliminated  with 
the  following  procedure.  Let  y  be  the  estimated  maximum  value  for  r(x). 
Monte  Carlo  observations  y  for  which  r(y)  s  y  are  accepted  or  rejected  as 
described  above.  If  r(y)  >  y  then  the  observation  is  included  in  the  sample 
L  times  where  L  is  the  integer  part  of  r(y)/y.  The  quantity  r(y)  -  L  is 
then  used  with  the  standard  accept/reject  procedure  to  determine  if  y  is 
accepted  yet  another  time. 
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TABLE  1 

Comparison  of  PPDE  and  10  nearest  neighbor  density 
estimation  (first  example). 


1.0  - 

|e£(p-p)2]  /  Var(p)| 

Dimension 

PPDE 

10  nearest  neighbors 

2 

76 

77 

5 

75 

57 

10 

64 

22 

TABLE  2 

Increase  in  (log)  likelihood  of  PPDE  solutions  from 
factored  initial  model  pQ(x.)  =  Pab^xa,xb^  pcd^xc,xd' 

(third  example) 

Combination  Number  of  Iterations 


a 

b 

c 

d 

2 

4 

1 

2 

3 

4 

85.7 

124.1 

1 

3 

2 

4 

47.1 

76.3 

1 

4 

2 

3 

85.8 

122.1 
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FIGURE  CAPTIONS 


Figure  1: 

(a)  Isopleths  of  a  Two  Dimensional  Density 

(b)  A  Sample  of  Size  225  Drawn  From  the  Density 
Figure  2: 

(a)  Isopleths  of  Initial  Model  -  Gaussian  With  Sample  Mean  and  Covariance 

(b)  Data  (Histogram)  and  Monte  Carlo  From  Initial  Model  p  (xj  Along 

First  Solution  Linear  Combination  o  =  (0,1).  0 

(c)  First  Augmenting  Function  f  . 

(d)  Isopleths  of  p,(_x). 

Figure  3: 

(a)  Data  (Histogram)  and  Monte  Carlo  from  Pj(x)  Along  Second  Solution 
Solution  Linear  Combination  e_2  =  (0.85,  0.53). 

(b)  Second  Augmenting  Function  fo. 

(c)  Isopleths  of  p  (x) . 

2  “ 

Figure  4 


(a) 

Data  (Histogram)  and  Monte  Carlo  from  p  (x)  Along  Third  Solution 
Linear  Combination  _e  =  (1,0).  2 

(b) 

Third  Augmenting  Function  f3. 

(c) 

Isopleths  of  p3  (x) . 

Figure  5: 

(a) 

Diabetes  data:  xj  versus  x3  . 

(b) 

Diabetes  data:  x2  versus  x4  . 

(c) 

Diabetes  Data  (Solid)  and  Monte  Carlo  From  Factored  Model 
("+")  Projection  Onto  Plane  Spanned  by  First  Two  Solution 
Combinations. 

P0(x) 

Linear 
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